Skip to content

feat(egress): wrap with supervisor + cleanup hook#951

Merged
hittyt merged 4 commits into
opensandbox-group:mainfrom
Pangjiping:feat/egress-supervisor
Jun 3, 2026
Merged

feat(egress): wrap with supervisor + cleanup hook#951
hittyt merged 4 commits into
opensandbox-group:mainfrom
Pangjiping:feat/egress-supervisor

Conversation

@Pangjiping
Copy link
Copy Markdown
Collaborator

Summary

Hard-crashed egress leaves stale iptables/nft rules and a zombie mitmdump holding port 18081; restarting the container then accumulates duplicate rules and the new mitmdump fails to bind, sending the in-process mitm watchdog (PR #942) into a retry loop. This change keeps the egress process under a dedicated supervisor so restarts are deterministic and the dirty state is reset on every launch and exit.

components/internal/supervisor: new shared single-worker supervisor. Exponential backoff with jitter, pre-start / post-exit hooks (failures non-fatal), crashloop circuit breaker, JSONL event log. SIGTERM is forwarded to the worker with a configurable grace period before SIGKILL. Includes unit + integration tests using a re-exec'd test binary as a fake child.

components/internal/cmd/supervisor: opensandbox-supervisor binary built from the same module; flag-driven, no new external deps.

components/egress/scripts/cleanup.sh: best-effort, idempotent reset of the iptables DNS REDIRECT rules, transparent-HTTP rules, the nftables opensandbox table, and stray mitmdump processes. Hard contract: never exit non-zero so a misbehaving cleanup cannot block restarts.

components/egress/Dockerfile: builds and installs the supervisor and the cleanup script alongside the egress binary under /opt/opensandbox-egress/, then switches the ENTRYPOINT to run the supervisor with cleanup as both pre-start and post-exit, grace 20s to cover the egress-internal shutdown budget.

Testing

  • Not run (explain why)
  • Unit tests
  • Integration tests
  • e2e / manual verification

Breaking Changes

  • None
  • Yes (describe impact and migration path)

Checklist

  • Linked Issue or clearly described motivation
  • Added/updated docs (if needed)
  • Added/updated tests (if needed)
  • Security impact considered
  • Backward compatibility considered

Pangjiping and others added 3 commits May 28, 2026 13:34
Hard-crashed egress leaves stale iptables/nft rules and a zombie mitmdump
holding port 18081; restarting the container then accumulates duplicate
rules and the new mitmdump fails to bind, sending the in-process mitm
watchdog (PR opensandbox-group#942) into a retry loop. This change keeps the egress
process under a dedicated supervisor so restarts are deterministic and
the dirty state is reset on every launch and exit.

components/internal/supervisor: new shared single-worker supervisor.
Exponential backoff with jitter, pre-start / post-exit hooks (failures
non-fatal), crashloop circuit breaker, JSONL event log. SIGTERM is
forwarded to the worker with a configurable grace period before SIGKILL.
Includes unit + integration tests using a re-exec'd test binary as a
fake child.

components/internal/cmd/supervisor: opensandbox-supervisor binary built
from the same module; flag-driven, no new external deps.

components/egress/scripts/cleanup.sh: best-effort, idempotent reset of
the iptables DNS REDIRECT rules, transparent-HTTP rules, the
nftables `opensandbox` table, and stray mitmdump processes. Hard
contract: never exit non-zero so a misbehaving cleanup cannot block
restarts.

components/egress/Dockerfile: builds and installs the supervisor and
the cleanup script alongside the egress binary under
/opt/opensandbox-egress/, then switches the ENTRYPOINT to run the
supervisor with cleanup as both pre-start and post-exit, grace 20s to
cover the egress-internal shutdown budget.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Self-review fixes:

- BackoffJitter is now *float64 so callers can pass &zero to disable
  jitter explicitly. The previous default override turned 0 into 0.1,
  making "no jitter" impossible. cmd/supervisor exposes the value via
  --backoff-jitter (default 0.1).
- Build hookEnv into a fresh slice instead of `append(spec.Env, ...)`,
  which could write into spec.Env's underlying array when cap > len.
- Hoist delete_until_gone to file scope in cleanup.sh; remove the two
  inline duplicates.
- Add cmd/supervisor argv tests: splitOnDoubleDash table cases, toHooks,
  openEventLog stderr + dir creation, eventLogDest label.
- Add backoff tests covering jitter=0 and applyDefaults() pointer
  semantics.
- Document signal handling in the package doc: SIGINT/SIGTERM trigger
  shutdown via ctx; SIGHUP / SIGUSR1 / SIGUSR2 / SIGWINCH / SIGQUIT are
  NOT forwarded.
- Remove dead fakeClock.advance helper.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Pangjiping Pangjiping added feature New feature or request component/egress labels May 28, 2026
@Pangjiping Pangjiping marked this pull request as ready for review May 28, 2026 06:16
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8b82b6ee00

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread components/egress/Dockerfile
Comment thread components/egress/Dockerfile Outdated
PR review (opensandbox-group#951, Codex P1) caught a real security
regression: the post-exit cleanup hook was tearing down the iptables DNS
REDIRECT rules and the `inet opensandbox` nft table before the supervisor
slept for backoff and relaunched egress. Because the egress sidecar shares
its network namespace with the workload it is meant to filter, that window
left the workload with unfiltered egress instead of the stale default-deny
rules continuing to protect it. With a worst-case crashloop budget of
10 launches over 5 minutes, that window can stretch to minutes.

The fix is to leave netfilter state alone between runs:

- Drop the post-exit hook entirely. The backoff window now keeps the
  previous run's enforcement rules in place.
- Slim cleanup.sh to mitmdump-reaping only. iptables rule accumulation
  across many restarts is a slower-burn drift that egress's own
  SetupRedirect tolerates (first match wins); the nftables manager
  already prepends `delete table` to its ruleset script, so ApplyStatic
  is idempotent. Neither needs hook intervention.
- Keep the pre-start mitmdump kill so the new egress can bind the
  transparent-MITM listen port without colliding with an orphan.

(Codex P2 — zombie reaping when supervisor is PID 1 — is intentionally
not addressed in this commit; it does not gate the security fix.)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@hittyt hittyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hittyt hittyt merged commit 2a91287 into opensandbox-group:main Jun 3, 2026
15 checks passed
Pangjiping added a commit to Pangjiping/OpenSandbox that referenced this pull request Jun 4, 2026
…ibility

PR opensandbox-group#951 moved the egress binary from /egress to /opt/opensandbox-egress/egress
so the supervisor and binary could share a single grouped directory. External
tooling and older deployment manifests may still reference the old /egress
path; add a symlink so both paths resolve to the same binary.

Symlink rather than COPY: zero extra image size, single source of truth for
chmod and replacement, and `exec /egress` resolves to the supervisor-managed
binary like before.
Pangjiping added a commit to Pangjiping/OpenSandbox that referenced this pull request Jun 4, 2026
…ibility

PR opensandbox-group#951 moved the egress binary from /egress to /opt/opensandbox-egress/egress
so the supervisor and binary could share a single grouped directory. External
tooling and older deployment manifests may still reference the old /egress
path; add a symlink so both paths resolve to the same binary.

Symlink rather than COPY: zero extra image size, single source of truth for
chmod and replacement, and `exec /egress` resolves to the supervisor-managed
binary like before.
Pangjiping added a commit to Pangjiping/OpenSandbox that referenced this pull request Jun 4, 2026
…ibility

PR opensandbox-group#951 moved the egress binary from /egress to /opt/opensandbox-egress/egress
so the supervisor and binary could share a single grouped directory. External
tooling and older deployment manifests may still reference the old /egress
path; add a symlink so both paths resolve to the same binary.

Symlink rather than COPY: zero extra image size, single source of truth for
chmod and replacement, and `exec /egress` resolves to the supervisor-managed
binary like before.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/egress feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants